Objective

The objective of this analytical report is to help companies identify good employees who are at risk of leaving the company. With this information, companies can allocate their finances and resources on in areas that can help in retaining good employees.

Analysis Process

First, we will analyze and visualize the data to get a basic understanding of the data inhand (Human Resources Analytics by Ludovic Benistant from kaggle.com). After obtaining a basic understanding of the data, we will check the correlation of the factors to identify and interpret the key factors that drive employees to leave.

Second, we will segment the entire employees by using the cluster method to observe which cluster of employees have a higher possbility of leaving.

Finally, we will bucket the employees (excluding the ones who have stayed) across two dimensions, performance and risk of leaving, in order to predict and identify the employees companies generally wish to retain even at a higher cost - high performing employees with high risk of leaving (and maybe even identify the low performing employees with low possiblity of leaving). This will help the company to target and invest in their human resources and reduce the risk and negative impact of losing high performing employees.

1. Data check and Visualisation

1.1 Load and Explore the data

First, let’s load the data to use.

ProjectData <- read.csv("./data/HR_data.csv")
ProjectData = data.matrix(ProjectData)

Description of the data Can we slightly rename the titles of the data in the excel file - or is it too complicated now?

  1. Employee satisfaction level
  2. Last evaluation
  3. Number of projects
  4. Average monthly hours
  5. Time spent at the company
  6. Whether they have had a work accident
  7. Whether they have had a promotion in the last 5 years
  8. Department
  9. Salary (1=low, 2=medium, 3=high)
  10. Whether employee has left

This is how the first 10 set of data (employees) look like.

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
satisfaction_level 0.38 0.80 0.11 0.72 0.37 0.41 0.10 0.92 0.89 0.42
last_evaluation 0.53 0.86 0.88 0.87 0.52 0.50 0.77 0.85 1.00 0.53
number_project 2.00 5.00 7.00 5.00 2.00 2.00 6.00 5.00 5.00 2.00
average_montly_hours 157.00 262.00 272.00 223.00 159.00 153.00 247.00 259.00 224.00 142.00
time_spend_company 3.00 6.00 4.00 5.00 3.00 3.00 4.00 5.00 5.00 3.00
Work_accident 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
left 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
promotion_last_5years 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
salary_level 1.00 2.00 2.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
sales 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
accounting 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
hr 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
technical 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
support 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
management 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
IT 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
product_mng 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
marketing 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
RandD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

The data we use here have the following descriptive statistics.

min 25 percent median mean 75 percent max std
satisfaction_level 0.09 0.44 0.64 0.61 0.82 1 0.25
last_evaluation 0.36 0.56 0.72 0.72 0.87 1 0.17
number_project 2.00 3.00 4.00 3.80 5.00 7 1.23
average_montly_hours 96.00 156.00 200.00 201.05 245.00 310 49.94
time_spend_company 2.00 3.00 3.00 3.50 4.00 10 1.46
Work_accident 0.00 0.00 0.00 0.14 0.00 1 0.35
left 0.00 0.00 0.00 0.24 0.00 1 0.43
promotion_last_5years 0.00 0.00 0.00 0.02 0.00 1 0.14
salary_level 1.00 1.00 2.00 1.59 2.00 3 0.64
sales 0.00 0.00 0.00 0.28 1.00 1 0.45
accounting 0.00 0.00 0.00 0.05 0.00 1 0.22
hr 0.00 0.00 0.00 0.05 0.00 1 0.22
technical 0.00 0.00 0.00 0.18 0.00 1 0.39
support 0.00 0.00 0.00 0.15 0.00 1 0.36
management 0.00 0.00 0.00 0.04 0.00 1 0.20
IT 0.00 0.00 0.00 0.08 0.00 1 0.27
product_mng 0.00 0.00 0.00 0.06 0.00 1 0.24
marketing 0.00 0.00 0.00 0.06 0.00 1 0.23
RandD 0.00 0.00 0.00 0.05 0.00 1 0.22

1.2 Scale the data

Here, we are standardizing the data in order to avoid having the problem of the result being driven by a few relatively large values. We will scale the data between 0 and 1.

ProjectDataFactor_scaled = apply(ProjectDataFactor, 2, function(r) {
    res = (r - min(r))/(max(r) - min(r))
    res
})

Below is the summary statistics of the scaled dataset.

min 25 percent median mean 75 percent max std
satisfaction_level 0 0.38 0.60 0.57 0.80 1 0.27
last_evaluation 0 0.31 0.56 0.56 0.80 1 0.27
number_project 0 0.20 0.40 0.36 0.60 1 0.25
average_montly_hours 0 0.28 0.49 0.49 0.70 1 0.23
time_spend_company 0 0.12 0.12 0.19 0.25 1 0.18
Work_accident 0 0.00 0.00 0.14 0.00 1 0.35
left 0 0.00 0.00 0.24 0.00 1 0.43
promotion_last_5years 0 0.00 0.00 0.02 0.00 1 0.14
salary_level 0 0.00 0.50 0.30 0.50 1 0.32
sales 0 0.00 0.00 0.28 1.00 1 0.45
accounting 0 0.00 0.00 0.05 0.00 1 0.22
hr 0 0.00 0.00 0.05 0.00 1 0.22
technical 0 0.00 0.00 0.18 0.00 1 0.39
support 0 0.00 0.00 0.15 0.00 1 0.36
management 0 0.00 0.00 0.04 0.00 1 0.20
IT 0 0.00 0.00 0.08 0.00 1 0.27
product_mng 0 0.00 0.00 0.06 0.00 1 0.24
marketing 0 0.00 0.00 0.06 0.00 1 0.23
RandD 0 0.00 0.00 0.05 0.00 1 0.22

1.3 Check Correlations

The simplest way to have a first look at a dataset is to check the correlation. By doing this, we can easily see which factors have a high positive/negative correlation with leaving employees. This is different from a causality, therefore we cannot conclude that a highly correlated factor (independent variables) leads an employee to leave (dependent variable). Also, if some of the factors (independent variables) are highly correlated with each other, we could also consider to group these attributes together.

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years salary_level sales accounting hr technical support management IT product_mng marketing RandD
satisfaction_level 1.00 0.11 -0.14 -0.02 -0.10 0.06 -0.39 0.03 0.05 0.00 -0.03 -0.01 -0.01 0.01 0.01 0.01 0.01 0.01 0.01
last_evaluation 0.11 1.00 0.35 0.34 0.13 -0.01 0.01 -0.01 -0.01 -0.02 0.00 -0.01 0.01 0.02 0.01 0.00 0.00 0.00 -0.01
number_project -0.14 0.35 1.00 0.42 0.20 0.00 0.02 -0.01 0.00 -0.01 0.00 -0.03 0.03 0.00 0.01 0.00 0.00 -0.02 0.01
average_montly_hours -0.02 0.34 0.42 1.00 0.13 -0.01 0.07 0.00 0.00 0.00 0.00 -0.01 0.01 0.00 0.00 0.01 -0.01 -0.01 0.00
time_spend_company -0.10 0.13 0.20 0.13 1.00 0.00 0.14 0.07 0.05 0.02 0.00 -0.02 -0.03 -0.03 0.12 -0.01 0.00 0.01 -0.02
Work_accident 0.06 -0.01 0.00 -0.01 0.00 1.00 -0.15 0.04 0.01 0.00 -0.01 -0.02 -0.01 0.01 0.01 -0.01 0.00 0.01 0.02
left -0.39 0.01 0.02 0.07 0.14 -0.15 1.00 -0.06 -0.16 0.01 0.02 0.03 0.02 0.01 -0.05 -0.01 -0.01 0.00 -0.05
promotion_last_5years 0.03 -0.01 -0.01 0.00 0.07 0.04 -0.06 1.00 0.10 0.01 0.00 0.00 -0.04 -0.04 0.13 -0.04 -0.04 0.05 0.02
salary_level 0.05 -0.01 0.00 0.00 0.05 0.01 -0.16 0.10 1.00 -0.04 0.01 0.00 -0.02 -0.03 0.16 -0.01 -0.01 0.01 0.00
sales 0.00 -0.02 -0.01 0.00 0.02 0.00 0.01 0.01 -0.04 1.00 -0.14 -0.14 -0.29 -0.26 -0.13 -0.18 -0.16 -0.15 -0.15
accounting -0.03 0.00 0.00 0.00 0.00 -0.01 0.02 0.00 0.01 -0.14 1.00 -0.05 -0.11 -0.10 -0.05 -0.07 -0.06 -0.06 -0.05
hr -0.01 -0.01 -0.03 -0.01 -0.02 -0.02 0.03 0.00 0.00 -0.14 -0.05 1.00 -0.11 -0.10 -0.05 -0.07 -0.06 -0.06 -0.05
technical -0.01 0.01 0.03 0.01 -0.03 -0.01 0.02 -0.04 -0.02 -0.29 -0.11 -0.11 1.00 -0.20 -0.10 -0.14 -0.12 -0.12 -0.11
support 0.01 0.02 0.00 0.00 -0.03 0.01 0.01 -0.04 -0.03 -0.26 -0.10 -0.10 -0.20 1.00 -0.09 -0.12 -0.11 -0.10 -0.10
management 0.01 0.01 0.01 0.00 0.12 0.01 -0.05 0.13 0.16 -0.13 -0.05 -0.05 -0.10 -0.09 1.00 -0.06 -0.05 -0.05 -0.05
IT 0.01 0.00 0.00 0.01 -0.01 -0.01 -0.01 -0.04 -0.01 -0.18 -0.07 -0.07 -0.14 -0.12 -0.06 1.00 -0.08 -0.07 -0.07
product_mng 0.01 0.00 0.00 -0.01 0.00 0.00 -0.01 -0.04 -0.01 -0.16 -0.06 -0.06 -0.12 -0.11 -0.05 -0.08 1.00 -0.06 -0.06
marketing 0.01 0.00 -0.02 -0.01 0.01 0.01 0.00 0.05 0.01 -0.15 -0.06 -0.06 -0.12 -0.10 -0.05 -0.07 -0.06 1.00 -0.06
RandD 0.01 -0.01 0.01 0.00 -0.02 0.02 -0.05 0.02 0.00 -0.15 -0.05 -0.05 -0.11 -0.10 -0.05 -0.07 -0.06 -0.06 1.00

Satisfaction level is most strongly negatively correlated with employees leaving.

<<<<<<< HEAD

2. Cluster Analysis and Segmentation

2.1 Select segmentation variables and methods

=======

2. Cluster Analysis and Segmentation

2.1 (1st try) Select segmentation variables and methods

>>>>>>> 465f19eb18a8b1893134b768a8dcd94d3ad8c75b

We use all the variables except “Whether the employee has left.” We use Euclidean distance.

segmentation_attributes_used = c(1:6, 8:19)
profile_attributes_used = c(1:19)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"

Here are the differences between the observations using the distance metric we selected:

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
Obs.01 0.00
Obs.02 1.21 0.00
Obs.03 1.39 0.89 0.00
Obs.04 0.97 0.55 0.96 0.00
Obs.05 0.02 1.22 1.39 0.98 0.00
Obs.06 0.06 1.23 1.43 0.99 0.06 0.00
Obs.07 1.03 0.98 0.58 0.75 1.03 1.07 0.00
Obs.08 1.12 0.53 1.11 0.28 1.13 1.13 0.94 0.00
Obs.09 1.17 0.60 1.12 0.28 1.18 1.19 0.97 0.29 0.00
Obs.10 0.08 1.23 1.43 0.98 0.10 0.07 1.08 1.13 1.17 0
<<<<<<< HEAD

2.2 Visualize Pair-wise Distances

We can see the histogram of, say, the first 2 variables.

or the histogram of all pairwise distances for the euclidean distance:

2.3 Number of Segments

=======

2.2 (1st try) Visualize Pair-wise Distances

We can see the histogram of, say, the first 2 variables.

or the histogram of all pairwise distances for the euclidean distance:

2.3 (1st try) Number of Segments

>>>>>>> 465f19eb18a8b1893134b768a8dcd94d3ad8c75b

Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:

We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.

<<<<<<< HEAD

2.4 Profile and interpret the segments

2.6 Robustness analysis

3. Drivers of Leaving Company

3.1 Classification tree 3.2 Profit curve

=======

For now let’s consider the 5-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:

Observation Number Cluster_Membership
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 2
20 1

2.5 (1st try) Profile and interpret the segments

Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.

The average values of our data for the total population as well as within each customer segment are:

Population Segment 1 Segment 2 Segment 3 Segment 4 Segment 5
satisfaction_level 0.57 0.57 0.57 0.57 0.58 0.58
last_evaluation 0.56 0.55 0.56 0.56 0.57 0.56
number_project 0.36 0.36 0.36 0.38 0.36 0.36
average_montly_hours 0.49 0.49 0.49 0.50 0.49 0.50
time_spend_company 0.19 0.19 0.20 0.18 0.17 0.18
Work_accident 0.14 0.14 0.15 0.14 0.15 0.13
left 0.24 0.25 0.22 0.26 0.25 0.22
promotion_last_5years 0.02 0.00 0.07 0.00 0.00 0.00
salary_level 0.30 0.27 0.34 0.28 0.27 0.29
sales 0.28 1.00 0.02 0.00 0.00 0.00
accounting 0.05 0.00 0.16 0.00 0.00 0.00
hr 0.05 0.00 0.15 0.00 0.00 0.00
technical 0.18 0.00 0.01 1.00 0.00 0.00
support 0.15 0.00 0.00 0.00 1.00 0.00
management 0.04 0.00 0.13 0.00 0.00 0.00
IT 0.08 0.00 0.00 0.00 0.00 1.00
product_mng 0.06 0.00 0.19 0.00 0.00 0.00
marketing 0.06 0.00 0.18 0.00 0.00 0.00
RandD 0.05 0.00 0.16 0.00 0.00 0.00

we can measure the ratios of the average for each cluster to the average of the population and subtract 1 (e.g. avg(cluster) / avg(population) - 1) and explore a matrix as the following one:

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5
satisfaction_level 0.00 0.00 -0.01 0.01 0.01
last_evaluation -0.02 0.00 0.01 0.02 0.00
number_project -0.02 -0.01 0.04 0.00 0.01
average_montly_hours 0.00 -0.01 0.01 0.00 0.01
time_spend_company 0.02 0.05 -0.06 -0.07 -0.02
Work_accident -0.04 0.04 -0.03 0.07 -0.08
left 0.05 -0.09 0.08 0.05 -0.07
promotion_last_5years -1.00 2.08 -1.00 -1.00 -0.89
salary_level -0.08 0.13 -0.04 -0.08 -0.04
sales 2.62 -0.93 -1.00 -1.00 -1.00
accounting -1.00 2.10 -1.00 -1.00 -1.00
hr -1.00 2.10 -1.00 -1.00 -1.00
technical -1.00 -0.97 4.51 -1.00 -1.00
support -1.00 -0.97 -1.00 5.73 -1.00
management -1.00 2.10 -1.00 -1.00 -1.00
IT -1.00 -1.00 -1.00 -1.00 11.22
product_mng -1.00 2.10 -1.00 -1.00 -1.00
marketing -1.00 2.10 -1.00 -1.00 -1.00
RandD -1.00 2.10 -1.00 -1.00 -1.00

The segment profile looks to depend too much on department.

Let’s try the analysis again excluding department information.

2.1 (2nd try) Select segmentation variables and methods

We use all the variables except “Whether the employee has left.” and department. We use Euclidean distance.

segmentation_attributes_used = c(1:6, 9)
profile_attributes_used = c(1:9)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"

Here are the differences between the observations using the distance metric we selected:

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
Obs.01 0.00
Obs.02 1.21 0.00
Obs.03 1.39 0.89 0.00
Obs.04 0.97 0.55 0.96 0.00
Obs.05 0.02 1.22 1.39 0.98 0.00
Obs.06 0.06 1.23 1.43 0.99 0.06 0.00
Obs.07 1.03 0.98 0.58 0.75 1.03 1.07 0.00
Obs.08 1.12 0.53 1.11 0.28 1.13 1.13 0.94 0.00
Obs.09 1.17 0.60 1.12 0.28 1.18 1.19 0.97 0.29 0.00
Obs.10 0.08 1.23 1.43 0.98 0.10 0.07 1.08 1.13 1.17 0

2.2 (2ns try) Visualize Pair-wise Distances

We can see the histogram of, say, the first 2 variables.

or the histogram of all pairwise distances for the euclidean distance:

2.3 (2nd try) Number of Segments

Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:

We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.

For now let’s consider the 5-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:

Observation Number Cluster_Membership
1 1
2 2
3 3
4 4
5 1
6 1
7 3
8 4
9 4
10 1
11 1
12 3
13 4
14 1
15 1
16 1
17 1
18 4
19 5
20 4

2.5 (2nd try) Profile and interpret the segments

Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.

The average values of our data for the total population as well as within each customer segment are:

Population Segment 1 Segment 2 Segment 3 Segment 4 Segment 5
satisfaction_level 0.57 0.36 0.70 0.08 0.69 0.61
last_evaluation 0.56 0.25 0.59 0.70 0.60 0.55
number_project 0.36 0.04 0.36 0.70 0.36 0.36
average_montly_hours 0.49 0.24 0.50 0.69 0.51 0.49
time_spend_company 0.19 0.13 0.20 0.28 0.16 0.19
Work_accident 0.14 0.00 0.00 0.00 0.00 1.00
left 0.24 0.78 0.09 0.55 0.15 0.08
promotion_last_5years 0.02 0.01 0.04 0.01 0.01 0.04
salary_level 0.30 0.25 0.58 0.27 0.00 0.30

we can measure the ratios of the average for each cluster to the average of the population and subtract 1 (e.g. avg(cluster) / avg(population) - 1) and explore a matrix as the following one:

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5
satisfaction_level -0.38 0.22 -0.86 0.20 0.07
last_evaluation -0.55 0.05 0.25 0.07 -0.01
number_project -0.90 0.00 0.94 0.01 -0.01
average_montly_hours -0.51 0.03 0.40 0.03 -0.01
time_spend_company -0.31 0.09 0.48 -0.16 0.01
Work_accident -1.00 -1.00 -1.00 -1.00 5.92
left 2.29 -0.64 1.33 -0.38 -0.67
promotion_last_5years -0.37 0.69 -0.72 -0.69 0.65
salary_level -0.16 0.94 -0.08 -1.00 0.02

2.6 Robustness analysis

3. Drivers of Leaving Company

3.1 Classification tree

3.2 Profit curve

>>>>>>> 465f19eb18a8b1893134b768a8dcd94d3ad8c75b

4. Business Decisions

Following the analysis above, several business decisions can be made.

First, companies can implement policies to control attrition rates, by managing the variables that have a high correlation with employees leaving.

Conclusion